{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Graphing network packets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook currently relies on HoloViews 1.9 or above. Run `conda install -c ioam/label/dev holoviews` to install it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data source comes from a publicly available network forensics repository: http://www.netresec.com/?page=PcapFiles. The selected file is https://download.netresec.com/pcap/maccdc-2012/maccdc2012_00000.pcap.gz.\n",
    "\n",
    "```\n",
    "tcpdump -qns 0 -r maccdc2012_00000.pcap | grep tcp > maccdc2012_00000.txt\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, here is a snapshot of the resulting output:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380\n",
    "09:30:07.780000 IP 192.168.24.100.1038 > 192.168.202.68.8080: tcp 0\n",
    "09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380\n",
    "09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380\n",
    "09:30:07.780000 IP 192.168.27.100.37877 > 192.168.204.45.41936: tcp 0\n",
    "09:30:07.780000 IP 192.168.24.100.1038 > 192.168.202.68.8080: tcp 0\n",
    "09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380\n",
    "09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380\n",
    "09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380\n",
    "09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given the directional nature of network traffic and the numerous ports per node, we will simplify the graph by treating traffic between nodes as undirected and ignorning the distinction between ports. The graph edges will have weights represented by the total number of bytes across both nodes in either direction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "python pcap_to_parquet.py maccdc2012_00000.txt\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting output will be two Parquet dataframes, `maccdc2012_nodes.parq` and `maccdc2012_edges.parq`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import holoviews as hv\n",
    "from holoviews import opts, dim\n",
    "import networkx as nx\n",
    "import dask.dataframe as dd\n",
    "\n",
    "from holoviews.operation.datashader import (\n",
    "    datashade, dynspread, directly_connect_edges, bundle_graph, stack\n",
    ")\n",
    "from holoviews.element.graphs import layout_nodes\n",
    "from datashader.layout import random_layout\n",
    "from colorcet import fire\n",
    "\n",
    "hv.extension('bokeh')\n",
    "\n",
    "keywords = dict(bgcolor='black', width=800, height=800, xaxis=None, yaxis=None)\n",
    "opts.defaults(opts.Graph(**keywords), opts.Nodes(**keywords), opts.RGB(**keywords))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "edges_df = dd.read_parquet('../data/maccdc2012_full_edges.parq').compute()\n",
    "edges_df = edges_df.reset_index(drop=True)\n",
    "graph = hv.Graph(edges_df)\n",
    "len(edges_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Edge bundling & layouts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datashader and HoloViews provide support for a number of different graph layouts including circular, force atlas and random layouts. Since large graphs with thousands of edges can become quite messy when plotted datashader also provides functionality to bundle the edges."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Circular layout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default the HoloViews Graph object lays out nodes using a circular layout. Once we have declared the ``Graph`` object we can simply apply the ``bundle_graph`` operation. We also overlay the datashaded graph with the nodes, letting us identify each node by hovering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "opts.defaults(opts.Nodes(size=5, padding=0.1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "circular = bundle_graph(graph)\n",
    "datashade(circular, width=800, height=800) * circular.nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Force Atlas 2 layout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For other graph layouts you can use the ``layout_nodes`` operation supplying the datashader or NetworkX layout function. Here we will use the ``nx.spring_layout`` function based on the [Fruchterman-Reingold](https://en.wikipedia.org/wiki/Force-directed_graph_drawing) algorithm. Instead of bundling the edges we may also use the directly_connect_edges function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forceatlas = directly_connect_edges(layout_nodes(graph, layout=nx.spring_layout))\n",
    "datashade(forceatlas, width=800, height=800) * forceatlas.nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Random layout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datashader also provides a number of layout functions in case you don't want to depend on NetworkX:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "random = bundle_graph(layout_nodes(graph, layout=random_layout))\n",
    "datashade(random, width=800, height=800) * random.nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Showing nodes with active traffic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select just nodes with active traffic we will split the dataframe of bundled paths and then apply ``select`` on the new Graph to select just those edges with a weight of more than 10,000. By overlaying the sub-graph of high traffic edges we can take advantage of the interactive hover and tap features that bokeh provides while still revealing the full datashaded graph in the background."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "overlay = datashade(circular, width=800, height=800) * circular.select(weight=(10000, None))\n",
    "overlay.opts(\n",
    "    opts.Graph(edge_line_color='white', edge_hover_line_color='blue', padding=0.1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Highlight TCP and UDP traffic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the same selection features we can highlight TCP and UDP connections separately again by overlaying it on top of the full datashaded graph. The edges can be revealed over the highlighted nodes and by setting an alpha level we can also reveal connections with both TCP (blue) and UDP (red) connections in purple."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "udp_opts = opts.Graph(edge_hover_line_color='red', node_size=20, \n",
    "                      node_fill_color='red', edge_selection_line_color='red')\n",
    "tcp_opts = opts.Graph(edge_hover_line_color='blue', \n",
    "                      node_fill_color='blue', edge_selection_line_color='blue')\n",
    "\n",
    "udp = forceatlas.select(protocol='udp', weight=(10000, None)).opts(udp_opts)\n",
    "tcp = forceatlas.select(protocol='icmp', weight=(10000, None)).opts(tcp_opts)\n",
    "layout = datashade(forceatlas, width=800, height=800, normalization='log', cmap=['black', 'white']) * tcp * udp\n",
    "\n",
    "layout.opts(\n",
    "    opts.Graph(edge_alpha=0, edge_hover_alpha=0.5, edge_nonselection_alpha=0, inspection_policy='edges',\n",
    "               node_size=8, node_alpha=0.5, edge_color=dim('weight')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Coloring by protocol"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have already seen we can easily apply selection to the ``Graph`` objects. We can use this functionality to select by protocol, datashade the subgraph for each protocol and assign each a different color and finally stack the resulting datashaded layers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bokeh.palettes import Blues9, Reds9, Greens9\n",
    "ranges = dict(x_range=(-.5, 1.6), y_range=(-.5, 1.6), width=800, height=800)\n",
    "protocols = [('tcp', Blues9), ('udp', Reds9), ('icmp', Greens9)]\n",
    "shaded = hv.Overlay([datashade(forceatlas.select(protocol=p), cmap=cmap, **ranges)\n",
    "                     for p, cmap in protocols]).collate()\n",
    "stack(shaded * dynspread(datashade(forceatlas.nodes, cmap=['white'], **ranges)), link_inputs=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting the highest targets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a bit of help from pandas we can also extract the twenty most targetted nodes and overlay them on top of the datashaded plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_counts = list(edges_df.groupby('target').count().sort_values('weight').iloc[-20:].index.values)\n",
    "overlay = (datashade(forceatlas, cmap=fire[128:]) * \n",
    "           datashade(forceatlas.nodes, cmap=['cyan']) *\n",
    "           forceatlas.nodes.select(index=target_counts))\n",
    "\n",
    "overlay.opts( opts.Nodes(size=8), opts.RGB(width=800, height=800))"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
